A methodology for document processing: separating text from images
نویسنده
چکیده
This paper presents a methodology for document processing, by separating text paragraphs from images. The methodology is based on the recognition of text characters and words for the efficient separation text paragraphs from images by keeping their relationships for a possible reconstruction of the original page. The text separation and extraction is based on a hierarchical framing process. The process starts with the framing of a single character, after its recognition, continues with the recognition and framing of a word, and ends with the framing of all text lines. The text line form a natural language text which requires analysis. # 2001 Published by Elsevier Science Ltd.
منابع مشابه
Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملرفع اعوجاج هندسی متون بهکمک اطلاعات هندسی خطوط متن
Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملComplex Background and Foreground Extraction in Color Document Images using Interval Type-2 Fuzzy
This paper deals with the problem of extracting the text information from complex ground from color document images. Developing general framework for separating the foreground text and background information from complex document image is still a challenging problem because of its high unpredictability and complexity. In this paper a new interval type-2 fuzzy based thresholding method is propos...
متن کاملSeparating text and background in degraded document images - a comparison of global thresholding techniques for multi-stage thresholding
Before any processing of the textual content of a document image can be performed the text must be separated from the background of the image. Several thresholding algorithms have previously been proposed and are widely used in document processing. None have been shown effective at thresholding difficult documents where the background and foreground are non-uniform. In this paper we investigate...
متن کامل